Add durable control-plane state storage#179
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a ControlPlaneStore to manage short-lived state such as admin sessions, OAuth tokens, and rate limits using MongoDB, replacing several in-memory implementations. API keys are enhanced with support for scopes, expiration, and organization/project bindings, including validation logic in the request dependencies. The update also enforces durable storage requirements for production environments and includes comprehensive unit tests for the new functionality. I have no feedback to provide.
|
| Filename | Overview |
|---|---|
| src/database/control_plane_store.py | New MongoDB-backed store for single-use tokens, admin sessions, and rate limits; single-use token path is correctly atomic via find_one_and_update, but the rate-limit path's find_one + update_one is non-atomic and can be defeated under concurrency. |
| src/database/api_key_store.py | Adds scopes, expiry, org/project bindings, expiry enforcement, and production fail-fast; the write-failure fallback incorrectly drops the new fields when recursing to in-memory storage. |
| src/api/dependencies.py | Wires org/project context enforcement via _api_key_matches_request_context and replaces the in-process rate limiter with the durable store; enforcement logic is correct but scopes stored on keys are not checked here. |
| src/api/routes/auth.py | Migrates MCP temp tokens and OAuth auth codes to the durable store; validation and single-use semantics are correct, but the local in-memory mirrors (_mcp_temp_tokens, _oauth_auth_codes) are now dead weight with a fragile expires_at lookup. |
| src/api/routes/admin.py | Replaces in-process _admin_sessions dict with durable store; session creation, lookup, and deletion all look correct. |
| src/api/routes/api_keys.py | Extends CRUD endpoints to expose and persist scopes, expiry, org/project bindings; plumbing is correct but the scopes field has no enforcement in the auth layer. |
| tests/unit/test_control_plane_store.py | Covers single-use token single-use semantics, admin session lifecycle, and rate-limit shape; the async rate-limit test lacks a pytest.mark.asyncio decorator but the rest looks solid. |
| tests/unit/test_database_stores.py | Adds coverage for scoped key creation, expiry validation, and production fail-fast on write errors. |
| tests/api/test_dependencies_and_routes.py | New test for bound API key org/project context enforcement; covers both the rejection and acceptance paths. |
Sequence Diagram
sequenceDiagram
participant Client
participant AuthRoute as auth.py
participant ControlPlaneStore
participant MongoDB
participant APIKeyStore
Note over AuthRoute,MongoDB: MCP Temp Token Flow
Client->>AuthRoute: POST /auth/mcp-token (JWT)
AuthRoute->>ControlPlaneStore: "create_single_use_token(mcp_temp_token, user_id, ttl=10m)"
ControlPlaneStore->>MongoDB: insert_one(record with token_hash)
MongoDB-->>ControlPlaneStore: ok
ControlPlaneStore-->>AuthRoute: "{token, expires_at}"
AuthRoute-->>Client: MCPTempTokenResponse
Client->>AuthRoute: POST /auth/mcp-exchange (temp_token)
AuthRoute->>ControlPlaneStore: consume_single_use_token(mcp_temp_token, token)
ControlPlaneStore->>MongoDB: "find_one_and_update(exchanged=False to True)"
MongoDB-->>ControlPlaneStore: record (atomic)
ControlPlaneStore-->>AuthRoute: user_id
AuthRoute->>APIKeyStore: create_api_key(user_id)
APIKeyStore-->>AuthRoute: "{key, scopes, org_id}"
AuthRoute-->>Client: MCPExchangeResponse(api_key)
Note over Client,MongoDB: API Request with Bound Key
Client->>AuthRoute: GET /... Bearer xmem_xxx + X-Org-Id header
AuthRoute->>APIKeyStore: validate_api_key(token)
APIKeyStore-->>AuthRoute: "key_doc {org_id, project_id, scopes}"
AuthRoute->>AuthRoute: _api_key_matches_request_context(key_doc, request)
alt context mismatch
AuthRoute-->>Client: 403 Forbidden
else context ok
AuthRoute->>ControlPlaneStore: check_rate_limit(identity, max, window)
ControlPlaneStore->>MongoDB: find_one + update_one (non-atomic)
MongoDB-->>ControlPlaneStore: allowed/remaining
ControlPlaneStore-->>AuthRoute: (True, remaining)
AuthRoute-->>Client: 200 OK
end
Comments Outside Diff (1)
-
src/api/routes/api_keys.py, line 199-202 (link)scopesfield is stored and returned but never enforcedAPI keys can now be created with a
scopeslist (e.g.,["memory:read"]), but nothing in the auth pipeline reads that list to gate which operations the key may perform. A key scoped to["memory:read"]can still call write endpoints. If scopes are intended as a future addition, a comment noting the enforcement gap would help; if they are meant to be active now, a scope-check step needs to be added (e.g., in_api_key_matches_request_contextor in a dedicatedrequire_scopedependency).
Reviews (1): Last reviewed commit: "Add durable control-plane state" | Re-trigger Greptile
| self._in_memory = True | ||
| return self.create_api_key(user_id, name) |
There was a problem hiding this comment.
When MongoDB
insert_one fails and the code falls back to in-memory storage, the recursive call passes only user_id and name, silently dropping scopes, expires_at, org_id, and project_id. Any caller who created a scoped or expiring key will get back an unscoped, non-expiring key with no error, breaking the constraint the caller intended to enforce.
| self._in_memory = True | |
| return self.create_api_key(user_id, name) | |
| self._in_memory = True | |
| return self.create_api_key( | |
| user_id, | |
| name, | |
| scopes=scopes, | |
| expires_at=expires_at, | |
| org_id=org_id, | |
| project_id=project_id, | |
| ) |
| record = self.rate_limits.find_one({"identity": identity}) or {} | ||
| hits = [hit for hit in record.get("hits", []) if hit > cutoff] | ||
| if len(hits) >= max_requests: | ||
| self.rate_limits.update_one( | ||
| {"identity": identity}, | ||
| { | ||
| "$set": { | ||
| "hits": hits, | ||
| "expires_at": _now() + timedelta(seconds=window_seconds), | ||
| } | ||
| }, | ||
| upsert=True, | ||
| ) | ||
| return False, 0 | ||
|
|
||
| hits.append(now) | ||
| self.rate_limits.update_one( | ||
| {"identity": identity}, | ||
| { | ||
| "$set": { | ||
| "hits": hits, | ||
| "expires_at": _now() + timedelta(seconds=window_seconds), | ||
| } | ||
| }, | ||
| upsert=True, | ||
| ) | ||
| return True, max(max_requests - len(hits), 0) |
There was a problem hiding this comment.
Non-atomic read-modify-write in MongoDB rate limiter
check_rate_limit on the MongoDB path does a find_one then a separate update_one with $set. Two concurrent requests can both read the same hits array (say, [t1, t2], length 2 with max_requests=3), both see they are under the limit, both append their own timestamp, and both call update_one({$set: {hits: [t1, t2, now]}}). The second write overwrites the first, so the counter effectively only advances by 1 instead of 2 — allowing a burst of concurrent calls to exceed the configured limit. The old asyncio.Lock() path was safe for a single process; the durable path needs atomic operations (e.g., $push / $size) or findOneAndUpdate with an aggregation pipeline to avoid this window.
| _mcp_temp_tokens[created["token"]] = {"expires_at": created["expires_at"]} | ||
| return created["token"] |
There was a problem hiding this comment.
Redundant in-memory dicts with fragile access pattern
_mcp_temp_tokens (and _oauth_auth_codes) now hold only a mirror of the expires_at returned by the store — they are never consulted for validation (the pop before consume_single_use_token is a no-op for correctness). The access _mcp_temp_tokens[temp_token]["expires_at"] on line 443 works only because both the store call and this access happen in the same synchronous request, but it is fragile: any future refactor that separates creation from the response lookup, or any multi-worker scenario where the store call and the response are on different workers, would produce a KeyError. The expires_at value is already available from created["expires_at"] returned by the store; use it directly and remove the dict entirely.
| class ControlPlaneStore: | ||
| """MongoDB-backed store for short-lived control-plane state.""" | ||
|
|
||
| def __init__(self, uri: str = None, database: str = None) -> None: | ||
| self._uri = uri or settings.mongodb_uri | ||
| self._database = database or settings.mongodb_database | ||
| self._client = None | ||
| self._db = None | ||
| self.records = None | ||
| self.rate_limits = None | ||
| self._connected = False | ||
| self._in_memory = False | ||
| self._try_connect() | ||
|
|
There was a problem hiding this comment.
Three independent
ControlPlaneStore instances, three connection pools
ControlPlaneStore() is instantiated at module level in auth.py, admin.py, and dependencies.py. Each instance calls MongoClient(...) in _try_connect, which opens its own connection pool. These instances are never shared, so the application holds three separate pools to the same database. Consider sharing a single instance (e.g., via a module-level singleton or a FastAPI app.state attribute) to avoid unnecessary resource overhead.
Summary
Verification
uv run --extra dev python -m pytest tests/unit/test_control_plane_store.py tests/unit/test_database_stores.py tests/api/test_dependencies_and_routes.pyuv run --extra dev ruff check src/database/control_plane_store.py src/database/api_key_store.py src/api/dependencies.py src/api/routes/auth.py src/api/routes/admin.py src/api/routes/api_keys.py tests/unit/test_control_plane_store.py tests/unit/test_database_stores.py tests/api/test_dependencies_and_routes.py/claim #161